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Abstract: We want to recover the regression function in the single-index 
model. Using an aggregation algorithm with local polynomi al estimators , 
we answer in particular to the second part of Question 2 from lStonei (1982) 
on the optimal convergence rate. The procedure constructed here has strong 
adaptation properties: it adapts both to the smoothness of the link function 
and to the unknown index. Moreover, the procedure locally adapts to the 
distribution of the design. We propose new upper bounds for the local 
polynomial estimator (which are results of independent interest) that allows 
a fairly general design. The behavior of this algorithm is studied through 
numerical simulations. In particular, we show empirically that it improves 
strongly over empirical risk minimization. 
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1. Introduction 

The single-index model is standard in statistical literature. It is widely used in 
several fields, since it provides a simple trade-off between purely nonparamctric 
and purely parametric approaches. Moreover, it is well-known that it allows 
to deal with the so-called "curse of dimensionality" phenomenon. Within the 
minimax theory, this phenomenon is explained by the fact that the minimax 
rate linked to this model (which is multivariate, in the sense that the number of 
explanatory variables is larger than 1) is the same as in the univariate model. 
Indeed, if n is the sample size, the minimax rate over an isotropic s-H61der ball 
is n - 2s /( 2s + d ) for mean integrated square error (MISE) in the d-dimensional 
regression model without the single- index constraint, while in the single-index 
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model, this rate is conjectured to be ri 

-2 S /(2s+i) bv lStond (jl982h . Hence, even 
for small values of d (larger than 2), the dimension has a strong impact on 
the quality of estimation when no prior assumption on the structure of the 
multivariate regression function is made. In this sense, the single-index model 
provides a simple way to reduce the dimension of the problem. 
Let (X, Y) £ M. d x K be a random variable satisfying 



Y = g(X) + <r(X)e, 



(1.1) 



where e is independent of X with law iV(0, 1) and where cr(-) is such that 
<tq < <t{X) < o\ a.s. for some <7o > and a known o\ > 0. We denote by P 
the probability distribution of (X, Y) and by Px the margin law in X or design 
law. In the single-index model, the regression function as a particular structure. 
Indeed, we assume that g can be written has 



g{x) = f(^x) 



(1.2) 



, where / : M — > M is the link function and where the di- 
d , or index. In order to make the representation (|1.2[) unique 



for all x £ 
rection d £ 

(identifiability), we ass u me th e following (see f or instanc e the s urvey paper by 
Geenens and Delecroix ( 2005 '). or Chapter 2 in Horowitz ( 19981 )): 



• / is not constant over the support of $ X; 

• X admits at least one continuously distributed coordinate (w.r.t. the 
Lebesgue measure); 

• the support of X is not contained in any linear subspace of M. d ; 

• i? £ S*! -1 , where Si -1 is the half- unit sphere defined by 



= {v £ R d | |M| 2 = 1 and v d > 0}, 



(1.3) 



where || • U2 is the Euclidean norm over Mr. 
We assume that the available data 

D n := [(Xi,Yi);l<i<n] 



(1.4) 



is a sample of n i.i.d. copies of (X, Y) satisfying (|1.1| and (|1.2|) . In this model, we 
can focus on the estimation of the index d based on D n when the link function 
/ is unknown, or we can focus on the estimation of the regression g when both / 
and are unknown. In this paper, we consider the latter problem. It is assumed 
below that / belongs to some family of Holder balls, that is, we do not suppose 
its smoothness to be known. 

Statistical l iterat ure on this model is wide. Among many other references, 
Horowitz] (|l998l) for appl ications in econometric s, an appl ication in med- 

Xia and Hardld (200S), see also iDelecroix et al 



see 

ical science can be found in 



(|2003|) . IDelecroix et al.l (|2006l) and the survey paper by I Geenens and Delecroixl 
2005 ). For the estimation of the index, see for instance iHristache et al. ( 200ll ): 



for testing the parametric versus the nonparametric single-index assumption, 
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see 



Stute and Zhu ( 2005 ) . See also a chapter in Gv5rfi et al. ( 2002h which is de- 



voted to dimension reduction techniques in the bounded regression model. While 
the literature on single- index modelling is vast, s everal problem s remain open. 
For instance, the second part of Question 2 from IStonel ( 19821 ) concerning the 
minimax rate over Holder balls in model (jl.ll) , (11.211 is st ill op en. The first part, 
concer ning additive modelling is handled in Yand (|2000al ) and Yang and BarronI 
( 19991) . 

This paper provides new minimax results about the single-index model, which 
provides an answer, in particual, to the latter question. Indeed, we prove that 
in model (|1.1|) . (|1.2[) . we can achieve the rate n - 2s /( 2s + 1 ) f or a link function in a 
whole family of Holder balls with smoothness s, see Theorem [TJ The optimality 
of this rate is proved in Theorem [2] To prove the upper bound, we use an 
estimator which adapts both to the index parameter and to the smoothness 
of the link function. This result is stated under fairly general assumptions on 
the design, which include any "non-pathological" law for Px- Moreover, this 
estimator has a nice "design-adaptation" property, since it does not depend 
within its construction on Px- 



2. Construction of the procedure 

The procedure developed here for recovering the regression does not use a plu- 
gin estimator by direct estimation of the index. Instead, it adapts to it, by 
aggregating several univariate estimators based on projected samples 

D m {v) := [(v T Xi,Yi),l<i<m], (2.1) 

where m < n, for several v in a regular lattice of SjT 1 . This "adaptation to 
the direction" uses a split of the sample. We split the whole sample D n into a 
training sample 

D m := [(Xi,Yi);l<i<m] 

and a learning sample 

D (m) := [(Xi,Yi);m + l <i<n}. 

The choice of the split size can be quite general (see Section [3] for details). In the 
numerical study (conducted in Section |4] below) , we consider simply m = 3n/4 
(the learning sample size is a quarter of the whole sample) , which provides good 
results, but other splits can be considered as well. 

Using the training sample, we compute a family {g^ ; A £ A} of linear (or 
weak) estimators of the regression g. Each of these estimators depend on a 
parameter A = (v,s) which make them work based on the data "as if" the true 
underlying index were v and "as if" the smoothness of the link function were s 
(in the Holder sense, see Section [3]). 

Then, using the learning sample, we compute a weight w(g) £ [0, 1] for each 
g £ {g w ; A £ A}, satisfying J2xe\ w (9 W ) = 1 - Tlicsc 

weights give a level of 
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significance to each weak estimator. Finally, the adaptive, or aggregated estima- 
tor, is simply the convex combination of the weak estimators: 

g:=J2^9 W )9 W - 

AeA 

The family of weak estimators consists of univariate local polynomial estima- 
tors (LPE), with a data-driven bandwidth that fits locally to the amount of 
data. In the next section the parameter A = (v, s) is fixed and known: we con- 
truct a univariate LPE based on the sample D m (v) = [(Zi,Yi); 1 < i < m] = 
[{v T Xi,Yiy,l<i<m]. 



2.1. Weak estimators: univariate LPE 

The LPE is standard in statistical literature, see for instance iFan and Giibeld 
( 19961 1995f ). among many others. We construct an estimator f of f based on 
i.i.d. copies [(Zi, Yi); 1 < i < m] of a couple (2,7)€lxl such that 

Y = f(Z) + a(Z)e, (2.2) 

where e is standard Gaussian noise independent of Z , a : K — > [erg, <j\\ C (0, +00) 
and / G H(s, L) where H(s, L) is the set of s-H61derian functions such that 

|/(W)( Zl )_/(W)( Z2 )| <L\ Zl -z 2 \°-W 

for any z\, z% G R, where L > and \_s\ stands for the largest integer smaller 
than s. This Holder assumption is standard in nonparametric literature. 

Let r £ N and ft > be fixed. If z is fixed, we consider the polynomial 
P(z,h) G Pol r (the set of real polynomials with degree at most r) which minimizes 
in P: 

m 

Y,{y i -P(Z t -z)) 2 l ZieI{zJt)7 (2.3) 
i=l 

where I(z, h) := [z — h, z + h] and we define the LPE at z by 

f(z,h) :=P(^)(0). 

The polynomial P( z ,h) l& well-defined and unique when the symmetrical matrix 
Z m (z, h), with entries 

1 m 7. -\-b 

(^^'= m ft [f(g>h)] E(-f £ ) a ( 2 - 4 ) 

for (a, b) G {0, . . . , i?} 2 , is definite positive, where Pz is the empirical distribu- 
tion of {Zi)i<i< m , given by 

1 m 

P z [A]:=-Vl Zjei (2.5) 
m — ' 

i=i 
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for any A C R. When Z m (z,h) is degenerate, we simply take f(z,h) := 0. 
The tuning parameter h > 0, which is called bandwidth, localizes the least 
square problem around the point z in (|2.3[) . Of course, the choice of h is of 
first importance in this estimation method (as with any linear method). An 
important remark is then about the design law. Indeed, the law of Z = v T X 
varies with v strongly: even if P\ is very simple (for instance uniform over 
some subset of R d with positive Lebesgue measure), P v t x can be "far" from 
the uniform law, namely with a density that can vanish at the boundaries of 
its support, or inside the support, see the examples in Figure [1] This remark 
motivates the following choice for the bandwidth. 



P.\ = Uniform OQ 











1 ^ 




.ensily 





(le! Lilly of Pj ^ 




Fig 1. Simple design examples 

If / G H(s, L) for known s and L, a "natural" bandwidth, which makes the 
balance between the bias and the variance of the LPE is given by 

H m (z) := argminji^ > - ^ }. (2.6) 

This bandwidth choice stabilizes the LPE, since it fits point-by-point to the 
local amount of data. We consider then 

f(z):=f(z,H m (z)), (2.7) 

for any z£l, which is in view of Theorem[3] (see Section[3|) a minimax estimator 
over H (s, L) in model (|2~2|) . 

Remark 1 . The reason why we consider local polynomials instead of some other 
method (like smoothing splines, for instance) is theoretical. It is linked with the 
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fact that we need minimax weak estimators under the general design Assump- 
tion KD)] so that the aggregated estimator is also minimax. 



2.2. Adaptation by aggregation 

If A := (v, s) is fixed, we consider the LPE given by (|2.7p . and we take 

9 W (x):=T Q (fW(V T x)), (2.8) 

for any x € M. d as an estimator of g, where tq(/) := max(— Q, min(Q, /)) is 
the truncation operator by Q > 0. The reason why we need to truncate the 
weak estimators is related to the theoretical results concerning the aggregation 
procedure described below, see Theorem|4]m Section[3] In order to adapt to the 
index i? and to the smoothness s of the link function, we aggregate the weak 
estimators from the family {g^ x '; A e A} with the following algorithm: we take 
the convex combination 

g:=Y, w (9 W )9 W (2-9) 

AeA 

where for a function g G {g( x >; A <G A}, the weight is given by 



w{g) 



exp(-TR {m) (g)) 



£ AeA exp(-Ti* (m) (sW)) 
with a temperature parameter T > and 



(2.10) 



(2.11) 



i— m+1 



which is the empirical sum of squares of g over the training sample (up to a divi- 
sion by t he sample size). Thi s aggre gation algorithm (with Gibbs weights) can be 
found in lLeung and Barron (|2006l) in the regression framework, for projection- 



Catonij2001^ 


Juditskv et al. 


and Yanell 2004 


)■ 



We can understand the aggregation algorithm in the following way: first, we 
compute the least squares of each weak estimators. This is the most natural 
way of assessing the level of significance of some estimator among the other 
ones. Then, we put a Gibbs law over the set of weak estimators. The mass of 
each estimator relies on its least squares (over the learning sample). Finally, the 
aggregate is simply the mean expected estimator according to this law. 

If T is small, the weights (|2.10[) are close to the uniform law over the set 
of weak estimators, and of course, the resulting aggregate is inaccurate. If T is 
large, only one weight will equal 1, and the others equal to 0: in this situation, 
the aggregate is equal to the estimator obtained by empirical risk minimization 
(ERM). This behavior can be also explained by equation (|5 . 10[) in the proof of 
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Theorem |H Indeed, the exponential weights (|2. 10[) realize an optimal tradeoff 
between the ERM procedure and the uniform weights procedure. Hence, T is 
somehow a regularization parameter of this tradeoff. 

The ERM already gives good results, but if T is chosen carefully, we expect 
to obtain a n estimator w hich outperforms the ERM. It has been proved theo- 
retically in iLecuel (j2007l ) that an aggregation procedure outperforms the ERM 
in the regression framework. This fact is confirmed by the numerical study con- 
ducted in Section [H where the choice of T is done using a simple leave-one-out 
cross-validation algorithm over the whole sample for aggregates obtained with 
several T . Namely, we consider the temperature 



T6T Ui£ 



(2.12) 



where g_) is the aggregated estimator (|2.9j) with temperature T, based on the 
sample D~ % = [(Xj,Yj);j ^ i], and where T is some set of temperatures (in 
SectionO we take T = {0.1, 0.2, . . . , 4.9, 5}). 

The set of parameters A is given by A := S x G, where G is the grid with 
step (logn) -1 given by 



G := {t 



mm; °mm 



(logn) \ 



2(logn)- 1 ,. 



,Sr] 



(2.13) 



The tuning parameters s m j n and s max correspond to the minimum and maximum 
"allowed" smoothness for the link function: for this grid choice, the aggregated 
estimator converges with the optimal rate for a link function in H (s, L) for any 
s G [s m in, s max ] in view of Theorem [TJ 

The set S = S 1 ^ 1 is the regular lattice of the half unit-sphere Si -1 with 
discretization step A which is constructed as follows. Let us introduce $(5) := 
U^>o{^i$} f~l [0,7r] and consider the function p : [0,7r] d-1 — > S' d_1 defined by 
p(<j>i, . . . ,4>d-i) = ■ ■ - ,Xd), where 

X\ = cos(0i) cos((/> 2 ) X • • • X COs{(f)d-l) 
x 2 = sin((/)i) cos(02) x • • • x cos(0d-i) 

xi = sin(^_i) cos(^) x • • • x cos(0 d _i) 

= sin(0 d _ 2 )cos(0 (i _ 1 ) 
x d = sin(^ d _i). 

Then, the regular lattice S 1 ^ 1 is constructed using Algorithm [TJ In Figure [2] we 
show S 1 ^ 1 for A = 0.1 and d = 2,3. The step is taken as 



A = (n log n )- 1/(2s ^\ 



(2.14) 
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Fig 2. Lattices S^" 1 for A = 0.1 and d = 2, 3 



which relies on the minimal allowed smoothness of the link function. For in- 
stance, if we want the estimator to be adaptive for link functions at least Lips- 
chitz, we take A = (nlogn) -1 / 2 . 

Input: d (dimension parameter) and A (discretization step) 
Output: dd ~~ 



1 (regular discretization of S d 1 ) 



si- 1 = 

= $(arccos(l - A 2 /2)) 
foreach 4>d-x G do 

$ rf _ 2 = ^(A/arccos^d-i)) 
foreach <frd-2 G &d-2 do 

$ d _ 3 = $(A/arccos(0 d _ 2 )) 



foreach <p2 G $2 do 

<f>i = <&(A/arccos(^ 2 )) 
foreach <fii G $1 do 

add the point of coordinates h(<f>i, . 
end 
end 



end 



end 



Algorithm 1: Construction of the regular lattice 5* 



A 



2.3. Reduction of the complexity of the algorithm 

The adaptive procedure described previously requires the computation of the 
LPE for each parameter A G A := A x C (actually, we do also a grid C over 
the radius parameter L in the simulations). Hence, there are l-S*^ 1 ! x \G\ x \C\ 
LPE to compute. Namely, this is (ir/ A) d_1 x \G\ x |£|, which equals, if \G\ — 
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\C\ = 4 and A = (nlogn)~ 1/2 (see Section gj to 1079 when d = 2 and to 
72722 when d = 3, which is much too large. Hence, the complexity of this 
procedure must be reduced: we propose a recursive algorithm which improves 
strongly the complexity of the estimator. Actually, the coefficients w(g^) are 
very close to zero (see Figures [7] and |8] in Section |4]) when A = (v, s) is such 
that v is "far" from the true index -d. Hence, these coefficients should not be 
computed at all, since the corresponding weak estimators do not contribute to 
the aggregated estimator (|2.9[) . Thus, instead of using a lattice of the whole half 
unit-sphere for detecting the index, we only build a part of it, which corresponds 
to the coefficients which are the most significative. This is done with an iterative 
algorithm, see Alogorithm [2j which makes a preselection of weak estimators to 
aggregate (B d (v,S) stands for the ball in (R d , || • H2) centered at v with radius 
S and R(m)(g) is given by l|2.1ip ). 

Input: (Xi,Yi) (Data), G (smoothness grid) 
Output: S (a section of S^ 1 ) 
Put A = (nlogn)" 1 / 2 and_A = (2dn)- 1 / ( - 2 ^- 1 ^ 
Compute the lattice S = S 1 ^ 1 and put A := S x G 
while Ao > A do 

find the point v such that (v,s) = A = argmin Agj ^ i?( m )(g( A )) 

put A =_A /2 

put S = Si; 1 n B d (v, 2A ) and A := S x G ; 
end 

Algorithm 2: Preselection of the coefficients 

When the algorithm exits, S is a section of the lattice S 1 ^ 1 centered at v with 
radius 2 rf_1 A, which contains (with a high probability) the points v € S^ 1 cor- 
responding to the largest coefficients w(g^ ) where A = (v, s, L) g S^ 1 xGx C. 
The aggegate is then computed for a set of parameters A = S x G x C us- 
ing (|2.9p with weights (|2 . 1 0|) . The parameter Ao is chosen so that the surface of 
B d (v, Aq) is C £ i(2c?n)" 1 / 2 : n is not a power of d. Moreover, the number of itera- 
tions is O(logn), thus the complexity is much smaller than the full aggregation 
algorithm. This procedure gives nice empirical results, see Section 01 We show 
the iterative construction of S in Figure [3] 



3. Main results 

The error of estimation is measured with the L 2 (Px)-norm, defined by 




where we recall that Px is the design law. We consider the set H®(s,L) := 
H(s, L) n {/ : M — >• 1R I ll/Hoo := sup^, \ f(x)\ < Q}. Since we want the adaptive 
procedure to work whatever i9 e S 1 ^ 1 is, we need to work with as general as- 
sumptions on the law of d T X as possible. The following assumption generalizes 
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Fig 3. Iterative construction of the set S of preselectioned weak estimators indexes. Weak 
estimators are aggregated only for v S S {bottom right), which is concentrated around the 
true index. 



the usual assumptions on random designs (when Px has a density with respect 
to the Lebesgue measure) that can be met in literature. Namely, we do not 
assume that the density of P v t x is bounded away from zero. Indeed, even with 
a very simple Px , this assumption holds for specific v only (see Figure [T|) . We 
say that a real random variable Z satisfies Assumption |(D)| if: 

Assumption (D). There is a density fi of Pz with respect to the Lebesgue 
measure which is continuous. Moreover, we assume that 

• fi is compactly supported; 

• There is a finite number of z in the support of \x such that \i(z) = 0; 

• For any such z, there is an interval I z = [z — a Zl z + b z ] such that fi is 
decreasing over [z — a z , z] and increasing over [z, z + b z ]; 
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• There is j3 > and 7 > such that 

Pz[I]>l\lf +1 

for any I, where \I\ stands for the length of I. 

This assumption includes any design with continuous density with respect 
to the Lebesgue measure that can vanish at several points, but not faster than 
some power function. 



3.1. Upper and lower bounds 

The next Theorem provides an upper bound for the adaptive estimator con- 
structed in Section [2] This upper bound holds for quite general tuning param- 
eters. The temperature T > can be arbitrary (but not in practice of course). 
The training sample size is given by 

m=[n(l -£„)], (3.1) 

where [x] is the integral part of x, and where l n is a positive sequence such that 
for all n, (logn)~ Q < t n < 1 with a > 0. Note that in methods involving data 
splitting, the optimal choice of the split size is open. The degree r of the LPE 
and the grid choice G must be such that s max < r + 1. 

The upper bound below shows that the estimator converges with the optimal 
rate for a link function in a whole family of Holder classes, and for any index. 
In what follows, E n stands for the expectation with respect to the joint law P n 
of the whole sample D n . 

Theorem 1. Let g be the aggregated estimator given by (|2.9[) with the weights (|2.10p . 
If for all v G S^T 1 , v T X satisfies Assumption \(P)\ we have 

sup sup E n \\g ~ g\\ 2 L2(Px) < Cn" 2 ^ 2 ^ 1 ' , 
tfgS^ 1 feHQ(s,L) 

for any s G [s m i n ,s max ] when n is large enough, where we recall that g(-) = 
f(i3 T -). The constant C > depends on oi, L, s m j n , s max and Px only. 

Note that g does not depend within its construction on the index °d, nor the 
smoothness s of the link function /, nor the design law Px- The assumption 
that v T X satisfies Assumption (D) for any v G S 1 ^ -1 holds, for instance, for the 



multivariate designs from Figure [TJ More generally, this property holds for any 
uniform law over a support that docs not have very "spik y" boundary. Note that 
this as sumption is more general than the one considered in lAudibert and Tsvbakov 



i2007|). 

In Theorem [5] below, we prove in our setting (when Assumption |(D)| holds 
on the design) that n - 2s /( 2s + 1 ) i s a lower bound for a link function in H(s,L) 
in the single-index model. 
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Theorem 2. Let s, L 
tion l(D]\ We have 

inf 



Q > and $ g S + 1 be such that $ 1 X satisfies Assump- 



sup E n \\g-g\\l 2(Px) >C'r 

3 feHQ{s,L) 



-2s/(2s+l) 



where the infimum is taken among all estimators based on data from (|1.1|) . (|1.2|) . 
and where C > is a constant depending on o~i,s,L and P$t x only. 

Theorem [T] and Theorem [2] together entail that n~ 2s '( 2s+1 ^ is the minimax 
rate for the estimation of g in model (|l.ip under the constraint Q1.2p when the 
link function belo ngs to an s-H61der class. It answers in particular to Question 2 
fromlStonej (|l982h . 



3.2. A new result for the LPE 

In this section, we give upper bounds for the LPE in the univariate regression 
model (|2.2p . Despite the fact that the literature about LPE is wide, the Theorem 
below is new. It provides a minimax optimal upper bound for the L 2 {Pz)- 
integrated risk of the LPE over Holder balls under Assumption |(D)| which is 
a general assumption for random designs having a density with respect to the 
Lebesgue measure. 

In this section, the smoothness s is supposed known and fixed, and we assume 
that the degree r of the local polynomials satisfies r + 1 > s. First, we give an 
upper bound for the pointwise risk conditionally on the design. Then, we derive 
from it an upper bound for the i 2 (Pz)-integrated risk, using standard tools from 
empirical process theory (see Appendix). Here, E m stands for the expectation 
with respect to the joint law P m of the observations [(Zi, Yi);l<i<m] from 
model (|2.2|) . Let us define the matrix 

Z m (z) := Z TO (z, H m {z)j 

where Z m (z, h) is given by (|2.4p and H m (z) is given by (|2.6p . Let us denote by 
A(M) the smallest eigenvalue of a matrix M and introduce Z" 1 := (Z±, . . . , Z m ). 

Theorem 3. For any z G SuppP^, let f(z) be given by ()2.7p . We have on the 
event {A(Z m (z)) > 0}: 

sup E m [(f(z) - f(z) f\Z™] < 2\{Z rn {z)r 2 L 2 H m {z) 2s . (3.2) 

f£H(s,L) 

Moreover, if Z satisfies Assumption \(D)\ we have 

sup E m [\\r Q (f) - f\\h {Pz) ] < C 2 m- 2 */^ (3 . 3) 
feHQ(s,L) 

for m large enough, where we recall that tq is the truncation operator by Q > 
and where C2 > is a constant depending on s, Q, and Pz only. 

Remark 2. While inequality (|3.2p in Theorem [3] is stated over {A(Z m (z)) > 0}, 
which entails the existence and the unicity of a solution to the linear system (|2.3p 
(this inequality is stated conditionally on the design), we only need Assump- 
tion [(D)] for inequality ()3.3p to hold. 
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3.3. Oracle inequality 



In this section, we provide an oracle inequality for the aggregation algorithm (|2.9|) 
with weights (|2.10p . This result, which is of independent interest, is stated for a 
general finite set {g^; A G A} of deterministic functions such that ||g' A '||oo < Q 
for all AeA. These functions are for instance weak estimators computed with 
the training sample (or frozen sample), which is independent of the learning 
sample. Let D := [(X;, Yi)\ 1 < i < \D\] (where \D\ stands for the cardinality of 
D) be an i.i.d. sample of (X, Y) from the multivariate regression model (jl.ip . 
where no particular structure like (|1.2p is assumed. 

The aim of aggregation schemes is to mimic (up to an additive residual) 
the oracle in jtr A ); A € A|. This aggregatio n framework has been considered , 
among others. bvlBirei ( 2005lh ICatonil (l200lh ljuditskv and Nemiroyskil d2000h. 
Leung and Barronl (|2006f ) JNemirovskT^OOd l. lTsybakovl (|2003bh and lYand (|2000bh . 



Theorem 4. The aggregation procedure g based on the learning sample D de- 
fined by (|2.9p and (|2.10p satisfies 



E 



D 



\\g-g\\h {Px) < (l + a)min||.g( A )- 5 ||| 2(Px) + 



CloglAKlogl^l) 1 ^ 



\D\ 



for any a > 0, where |A| denotes the cardinality of A, where E D stands for the 
expectation with respect to the joint law of D, and where C := 3[8Q 2 (l + a) 2 /a + 
4(6Q 2 + 2a 1 2 v / 2)(l + o)/3] + 2 + 1/T. 

This theorem is a model-selection type oracle inequality for the aggregation 
procedure given by (|2.9I) and (12.101) . S harper o racle inequalities for more general 
models can be found in lJuditskv et al.l (|2005aT ). where the algorithm used therein 
requires an extra cumulative sum. 

Remark 3. Inspection of the proof of Theorem [4] shows that the ERM (which 
is the estimator minimizing the empirical risk R( m )(g) '■= Xw=m+i0^ — d(Xi)) 2 
over all g in {g^';\ g A|) s atisfie s the same oracle inequality. Nevertheless, 
it has been proved in iLecua (|2007l ) that the ERM is theoretically suboptimal 
in this framework, when we want to mimic the oracle without the extra factor 
1 + a in front of the biais term min^eA \\g^ — 9\\l 2 Ip x )- ^ ne simulation study 
of Section 2] (especially Figures 21 O confirms this suboptimality. 



4. Numerical illustrations 

We implemented the procedure described in Section [5] using the R software 
(see http://www.r-project.org/). In order to increase computation speed, 
we implemented the computation of local polynomials and the bandwidth se- 
lection (|2.6|) in C language. The simulated samples satisfy (|l.ip . (|1.2p . where the 
noise is centered Gaussian with homoscedastic variance 

CT = [ E /(^ T A t ) 2 /(»xrsnr)] 1/2 , 

Ki<« 
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where rsnr = 5. This choice of a makes the root-signal-to-noise ratio, which is 
a commonly used assessment of the complexity of estimation, equals to 5. We 
consider the following link functions (see the dashed lines in Figures [9] and [T0|) : 

• oscsine(a;) = A(x + 1) sin(47rx 2 ), 

• hardsine(a;) = 2sin(l + x) sin(27ra; 2 + 1). 

The simulations are done with a uniform design on [—1, 1] , with dimensions 
d G {2, 3, 4} and we consider several indexes i? that make P$t x not uniform. 

In all the computations below, the parameters for the procedure arc A = 
S x G x C where S is computed using the algorithm described in Section 12.31 
and where G = {1,2,3,4} and C = {0.1,0.5,1,1.5}. The degree of the local 
polynomials is r = 5. The learning sample has size [h/4], and is chosen randomly 
in the whole sample. We do not use a jackknife procedure (that is, the average 
of estimators obtained with several learning subsamples), since the results are 
stable enough (at least when n > 100) when we consider only one learning 
sample. 

In Tables HJ 1 and Figures ||lwe show the mean MISE for 100 
replications and its standard deviation for several Gibbs temperatures, sev- 



Table 1 

MISE against the Gibbs temperature (/ = hardsine, d = 2, i9 = (l/\/2, 1/V2).) 



Temperature 


0.1 


0.5 


0.7 


1.0 


1.5 


2.0 


ERM 


aggCVT 


n = 100 


0.026 


0.017 


0.015 


0.014 


0.014 


0.015 


0.034 


0.015 




(.009) 


(.006) 


(.006) 


(.005) 


(.005) 


(.006) 


(.018) 


(.005) 


n = 200 


0.015 


0.009 


0.008 


0.008 


0.009 


0.011 


0.027 


0.009 




(.004) 


(.002) 


(.003) 


(.003) 


(.005) 


(.007) 


(.014) 


(.004) 


n = 400 


0.006 


0.005 


0.004 


0.005 


0.006 


0.007 


0.016 


0.005 




(.001) 


(.001) 


(.001) 


(.001) 


(.002) 


(.002) 


(.003) 


(.002) 








Table 2 










MISE against the Gibbs temperat 


tire (/ = 


hardsine 


, d = 3, 


d = (2/v 


14, 1/VT4,3/VT4)). 


Temperature 


0.1 


0.5 


0.7 


1.0 


1.5 


2.0 


ERM 


aggCVT 


n = 100 


0.029 


0.021 


0.019 


0.018 


0.017 


0.018 


0.037 


0.020 




(.011) 


(.008) 


(.008) 


(.007) 


(.008) 


(.009) 


(.022) 


(.008) 


n = 200 


0.016 


0.010 


0.010 


0.009 


0.009 


0.010 


0.026 


0.010 




(.005) 


(.003) 


(.003) 


(.002) 


(.002) 


(.003) 


(0.008) 


(.003) 


n = 400 


0.007 


0.006 


0.005 


0.005 


0.006 


0.007 


0.017 


0.006 




(.002) 


(.001) 


(.001) 


(.001) 


(.001) 


(.002) 


(.003) 


(.001) 








Table 3 










MISE against the Gibbs 


temperature (/ = hardsine, 


d = A,S 


= (l/y/21, -2/V21, 0, 4/^21)) 


Temperature 


0.1 


0.5 


0.7 


1.0 


1.5 


2.0 


ERM 


aggCVT 


n = 100 


0.038 


0.027 


0.021 


0.019 


0.017 


0.017 


0.038 


0.020 




(.016) 


(.010) 


(.009) 


(.008) 


(.007) 


(.007) 


(.025) 


(.010) 


n = 200 


0.019 


0.013 


0.012 


0.012 


0.013 


0.014 


0.031 


0.013 




(.014) 


(.009) 


(.010) 


(.011) 


(.012) 


(.012) 


(.016) 


(.010) 


n = 400 


0.009 


0.006 


0.005 


0.005 


0.006 


0.007 


0.017 


0.006 




(.002) 


(.001) 


(.001) 


(.001) 


(.001) 


(.002) 


(.004) 


(.001) 
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5 10 15 20 25 30 5 10 15 20 25 30 

T T 

Fig 4. MISE against the Gibbs temperature for f = hardsine, i? = (l/\/2, 1/V2), n = 200,400 
(solid line = mean of the MISE for 100 replications, dashed line = mean MISE ± standard 
deviation.) 




5 10 15 20 25 30 5 10 15 20 25 30 

T T 

Fig 5. MISE against the Gibbs temperature for f = hardsine, 1? = (2/v / 14, l/v 7 !?, 3/\/l4), 
n = 200, 400 (solid line = mean of the MISE for 100 replications, dashed line = mean MISE 
± standard deviation.) 

eral sample sizes and indexes. These results empirically prove that the aggre- 
gated estimator outperforms the ERM (which is computed as the aggregated 
estimator with a large temperature T = 30) since in each case, the aggre- 
gated estimator with cross- validated temperature (aggCVT, given by (|2.12p . with 
T = {0.1, 0.2, . . . , 4.9, 5}), has a MISE much smaller than the MISE of the ERM. 
Moreover, aggCVT is more stable than the ERM in view of the standard devi- 
ations (in brackets). Note also that as expected, the dimension parameter has 
no impact on the accuracy of estimation: the MISEs are barely the same when 
d= 2,3,4. 

The aim of Figures [7] and [5] is to give an illustration of the aggregation 
phenomenon. In these figures, we show the points 



{(1 + w{g^))$ for A = (0, s, L) e A = S^ 1 x {3} x {!}} (4.1) 
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Fig 7. Weights associated to each point {see H4.1I I') of the lattice for A = 0.03, i? = 
(1/V2,1/V2) and T = 0.05,0.2,0.5, 10 {from top to bottom and left to right.) 
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Fig 8. Weights associated to each points {see 1)4. Q ) of the lattice for A = 0.07, 1? = 
(0,0,1), and T = 0.05, 0.3, 0.5, 10 {from top to bottom and left to right) . 

obtained for a single run (that is, we take s = 3 and L = 1 in the bandwidth 
choice (|2.6[) and we do not use the reduction of complexity algorithm). These 
figures motivates the use of the complexity reduction algorithm, since only the 
weights corresponding to a point of S^ 1 which is close to the true index are 
significant (at least numerically). Moreover, these weights provide information 
about the true index: the direction v € S^ 1 corresponding to the largest coeffi- 
cient w(g^') for A = (v, s, L) is an accurate estimator of the index, see Figures[7] 
and[8l Finally, we show typical realisations for several index functions, indexes 
and sample sizes in Figures O EH [HI E21 
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> training 
' learning 



• i 



-1.0 -0.5 0.0 0.5 1.0 




-1.0 -0.5 0.0 0.5 



Fig 9. Simulated datasets and aggregated estimators with cross-validated temperature for 
f = hardsine, n = 100, and indexes i9 = (l/\/2, l/\/2), J? = (2/\/l4, l/v 7 !!, 3/\/l4), i? = 
(l/\/2T, -2/V21, 0, W^T) /rom top to bottom. 




Fig 10. Simulated datasets and aggregated estimators with cross-validated temperature for 
f = oscsine, n = 100, and indexes ■& = (l/y/2, l/V%), d = (2/VH, l/Vli, 3/%/l4), = 
(1/V21, -2/v/21, 0, 4/\/2l) jrom top to bottom. 
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Fig 11. Simulated datasets and aggregated estimators with cross-validated temperature for 
f = hardsine, n = 200, and indexes d = (l/\/2, l/\/2), i9 = (2/y/li, 1/Vl4, 3/VTi), = 
(1/V21, -2/V21, 0, 4/\/2T) from top to bottom. 
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-1.0 -0.5 0.0 0.5 1.0 -1.0 -0.5 0.0 0.5 1.0 





— i 1 1 1 r~ 

-1.0 -0.5 0.0 0.5 1.0 




— i 1 1 1 r~ 

-1.0 -0.5 0.0 0.5 1.0 



Fig 12. Simulated datasets and aggregated estimators with cross-validated temperature for 
f = oscsine, n = 200, and indexes ■& = (l/y/2, l/V%), d = (2/v / 14, l/Vli, 3/%/li), •& = 
(1/V21, -2/v/21, 0, 4/v^2T) from top to bottom. 
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5. Proofs 

Proof of Theorem [7] 

The functions g^ are given by l|2.8|) . They are computed based on the training 
(or "frozen") sample D m , which is independent of the learning sample Dr m )- If 
denotes the integration with respect to the joint law of D( m ), we obtain 
using Theorem [3J 

pWr l|2 n-(A) ,|2 , Clog|A|(log|D( m) |) 1/2 
E { >\\g- g\\ L i(p x ) < (l + o)™n||fl l > ~9\\l^p x ) + fn j 

<(l + a)\\gW-g\\l HPx) +o(n-^+% 

sinceIog|A|Qog|D Lro )|)Va/|D (wi) | < d(logn)_ 3 / 2 +7(2 Smin n) (sec (03J and (EH)), 
and where A = s) G A is such that ||$ — i9||2 < A and [sj = [sj with 
s £ [s,s + (logn) -1 ]. By integration with respect to P m , we obtain 

E n \\g - g\\h {Px) < (l + o)^||^ - 5 ||| 2 (P X) + («- 2s/(2s+1) ). (5.1) 

The choice of A entails H Q (s, L) C £f Q (s, L) and 

n -2s/(2s+l) < e l/2 n -2s/(2s+l)_ 

Thus, together with (|3.ip and (|5.ip . the Theorem follows if we prove that 



sup E m \\fV _ g \\l 2 , < Cm-^^+V. (5.2) 

feHQ(sx) 

for n large enough, where C > 0. We cannot use directly Theorem [3] to prove 
this, since the weak estimator g^ works based on data D m (d) (see (|2.1[) ) while 
the true index is #. In order to clarify the proof, we write g^ instead of g^ 
since in (|5.2|) . the estimator uses the "correct" smoothness parameter s. We 
have 

\\g® - 5||^(p x) < 2(||ff^(-) - f($ T -)\\h iPx ) + ll/(# T -) - f(V T -)\\h {Px )) 

and using together (12.141) and / e H®(s, L) for s > s m ; n , wc obtain 

11/(^0 - f(# T -)\\h( Px) <L 2 J \\x\\l s -P x {dx)tf^ < C{n\ognr\ 

Let us denote by Q${-\X™) the joint law of (X i} 5^)i<j< m from model (jl.ljl 
(when the index is i?) conditional on the (Xi)i<i< m , which is given by 

Q4d yi W ) == 11 (or(ari)(27r) i /2) ^ C 2o( Xi )> ) dyi - 
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Under Q$(-\X™), we have 
dQ4-\X?) 



dQt(-\XD 



(law) / ^ ejjfj^Xj) - f^ T Xj)) 1 ^ (f^ T X t ) - f(0 T Xj))* 

Hence, if PjJ 1 denotes the joint law of {X\, . . . , X m ), 
E m \\§ m (-)-W T -)\\h( Px ) 



gVH-)-f(F-)\\l? w L x W,0)dQ 9 MXT)dP : 



X 



<C J J - f(d T -)\\h( Px) dQo(-\X?)dPZ (5.3) 

+ 4Q 2 | | L x (^,i9)l {ix( ^ ) > C} dg^(.|Xr)dPJ? ; 

where we decomposed the integrand over {L x (-d, #) > C} and {Lxifi, $) < C 1 } 
for some constant C > 3, and where we used the fact that ||<7^||oo, ||/||oo < Q- 
Under the (Xi,Yi) have the same law as (X,Y) from model (|l.ip 

where the index is Moreover, we assumed that P,jt x satisfies Assumption |(D)| 
Hence, Theorem [3] entails that, uniformly for / £ H^(s,L), 



\\fW(d T -) - f{¥-)\\ 2 L2(Px) dQr B (-\X?)dP% < C'm- 2s /^ +1 \ 

Moreover, the second term in the right hand side of ()5.3j) is smaller than 

4Q 2 J (| L x (d,d) 2 dQ 9 (-\xrj) 1/2 Qt[L x (0,0) >C\XF] 1/2 dP%. 

Since / € H^(s,L) for s > s m ; n , since Pjt is compactly supported and since 
<j(X) > ctq a.s., we obtain using (|2.14[) : 



Lx(^, l ?) 2 ^(-|X 1 " i )<exp(-^ 



1 ^ (/(tf T X,) - /0? T JQ)) : 



2 



< 1 



i=l 

Pjp-a.s. when m is large enough. Moreover, with the same arguments we have 

Q$[L X > C\XT] < m- (logC)2/2 < m-^/^+V 

for C large enough, where we use the standard Gaussian deviation P[N(0, b 2 ) > 
a] < cxp(— a 2 / (2b 2 )). This concludes the proof of Theorem[T] □ 
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Proof of Theorem [H 

Wc want to bound the minimax risk 



inf sup E n (g(x)~ f($ T x)yP x (dx) (5.4) 

9 feHQ(sX) J 

from below, where the infimum is taken among all estimators M. d — > K based on 
data from model (|l.l|l . (|1.2|) . We recall that $ T X satisfies Assumption (D) Wc 
consider $( 2 \ . . . in R d such that ($,$( 2 \ . . . , i9^) is an orthogonal basis 
of R d . We denote by O the matrix with columns $,$( 2 \ . . . , We define 
Y := OX = and Y 2 d := (Y^ , . . . , Y (d) ). By the change of 

variable y = Ox, we obtain 

(g(x)- f(d T x)) 2 P x (dx) 

(giO-^-fiy^fPridy) 

t (.9(0-^) - f{y^)) 2 P Y ^ Yil) {dy d 2 \v^)P YW {d V ^) 

{Ky {l) )-f{y {1) )) 2 P»T X {dyV), 



> 



where /(yC 1 )) := /^(O^^P^i^^lyW). Hence, if Z := tf T X, 
larger than 

inf sup E n f (f(z)- f(z)) 2 P z (dz), (5.5) 

/ f£HQ(s,L) J 

where the infimum is taken among all estimators K — ■* R based on data from 
model (jl.ljl with d = 1 (univariate regres sion). In order to bound (|5.5| from 
below, we use the following Theorem, from iTsvbakov ( 2003a[ ). which is a stan- 
dard tool for the proof of such a lower bound. We say that d is a semi-distance 
on some set if it is symmetric, if it satisfies the triangle inequality and if 
8(6,6) = for any 6 £ 6. We consider K(P\Q) := J*log(|g)dP the Kullback- 
Lcibler divergence between probability measures P and Q. 

Theorem 5. Le£ (0,9) &e a set endowed with a semi- distance 8. We suppose 
that {Pe', 6 £ 0} is a family of probability measures on a measurable space (X ,A) 
and that (w n )n6N is a sequence of positive numbers. If there exist {6q, . . . , 6m} C 
6, with M > 2, such that 

• d(6j,6 k ) >2v n VO < j < k < M 
. P Bj « P 6o VI < j < M, 

• JT Ef=i K(P% \PD < alogM for some a £ (0, 1/8), 
then 

Msu V E2[(v^d(e n ,6)f] >^-(l-2a-2.[^- 
6 n eee 1 + VM \ V lo & M 

where the infimum is taken among all estimators based on a sample of size n. 
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Let us define m := \_con 1 ^ 2s+1 ' \ , the largest integer smaller than con 1 ^ 2s+1 \ 
where cq > 0. Let 92 : M — ► [0, +00) be a function in H^(s, 1/2; K) with support 
in [-1/2, 1/2]. We take h n := nT 1 and z k := (fc - 1/2) /m for fee {1, . . . ,m}. 
For u> G f2 := {0, 1}™, we consider the functions 

rn 

f(-;u):=^2u k ip k (-) where <p fc (-) := Lh s n <p(^— —J . 
fc=i ™ 

We have 

||/(-;w)-/(-;w')ll^(Pz) = (E(^-^') 2 / Mz)*Pz(dz) 

fc=i ^ 

> f/ /2 p(uj,Lj')L 2 h 2 n s+1 [ <p(u) 2 du, 



where 5 M := SuppP^ — U z [a z ,6 2 ] (the union is over the z such that p(z) = 0, 
see Assumption [(D)]) , where p := min z6 s M p(z) > and where 

m 

p(uj,uj') —Yl 1 ^^ 

fe=l 

is the H amming distance on f2. Using a result of Varshamov-Gilbert (see iTsvbakovl 
(j2003a[ )) we can find a subset {w (0) , . . . , w (M) } of fi such that w< ) = (0, . . . , 0), 
p(u<j\u)W) > m/8 for any < j < k < M and M > 2 m / 8 . Hence, we have 

\\f(. ] ^)-f(-;^)\\ LHPz) >Dn- s /( 2s+1 \ 
where D — p^ 2 J s (p(u) 2 du/(8cQ S ) > 2 for cq small enough. Moreover, 

M M 

m E ^; ( ,^ )) |p; ( ,^ )) ) < ^ £ ll/( -;- (0) ) - /( l HPz) 
fc=i ° fe=i 



'0 

where a := ( J L 2 ||^||^)/(cr 2 c^ +1 log 2) e (0,1/8) for c small enough. The con- 
clusion follows from Theorem [5] □ 



Proo/ of Theorem H 



We recall that r = [s\ is the largest integer smaller than s, and that \{M) 
stands for the smallest eigenvalue of a matrix M. 
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Proof of \EM 

First, we prove a bias- variance decomposition of the LPE at a fixed point z S 
Supp P^- Th is kind of result is commonplace, see for instance Fan and Giibelsl 
( 19951 19961 ). We introduce the following weighted pseudo- inner product, for 
fixed zeK and h > 0: 

(f>g)h '■= — 5-777 — ~ z2 f{. z %)9{ z i)^z^i{z,h)^ 
mPz[I(z,h)\ 

where we recall that I(z 7 h) = [z — h, z + h], and that Pz is given by (|2.5p . We 
consider the associated pseudo-norm \\g\\\ := (g,g)h- We introduce the power 
functions f a {-) '■= ((• — z)/h) a for a e {0, . . . ,r}, which satisfy ||<p ||/i < 1. 

Note that the entries of the matrix Z m = Z m (z 7 h) (see (j2.4|l ) satisfy (Z^ (z. h))„ h := 
(f a, <£b)h for (a, b) g {0, . . . , r} 2 . Hence, (|2.3j) is equivalent to find P £ Po\ r such 
that 

{P,<Pa)h = {r,<Pa)h (5-6) 

for any a E {0, .. ._, r}, where (Y, (p) h := (mP z \I(z,h)\)- 1 Y^ l =1 Yif{Z l )\z^i{z,K)- 
In other words, P is the projection of Y onto PoL- with respect to the inner 
product (•, -) h . For e x := (1, 0, . . . , 0) € R r+1 , we have 

f(z)-f(z) = eJZ- 1 Z m (e-6) 

whenever A(Z m ) > 0, where 9 is the coefficient vector of P and 6 is the coefficient 
vector of the Taylor polynomial P of / at z with degree r. In view of (| 5 . G [) : 

{Z m (6-9)) a = (P-P,<p a ) h = (Y-P,ip a } h , 

thus Z m (0-0)) =B + y where (B)„ := (f-P,tp a ) h and (V% := (cr(-)e, Va)fc- 
The bias term satisfies \eJZ^B\ < (r + l) 1 / 2 ^" 1 !! where for any a € 
{0,...,r} 

|(B)a| < ||/-P|U<i/i s A!- 

Let Z a m be the matrix with entries (Z^) 0) 6 := (er(-)y> a , a(-)tpb) h- Since y is, 
conditionally on Z" 1 = (Zi, . . . , Z m ). centered Gaussian with covariance matrix 
(mPz[I(z, ft-)]) _1 Z^, we have that eJZ^-V is centered Gaussian with variance 
smaller than 

(mP z [I(z,h)])- 1 elZ^Z° n Z-te 1 < at(mP z [I(z, / l )])- 1 A(Z m )" 1 
where we used <x(-) < o~\. Hence, if C r := (r + l) 1 / 2 /r\, we obtain 

E rn [(f(z) - f(z)) 2 \Zr] < X{Z m (z, h))- 2 (C r Lh s + a^nPzlHz, /i)])~ 1/2 ) 2 
for any z, and the bandwidth choice (|2.6[) entails (|3.2[) . 
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Proof of (33) 

Let us consider the sequence of positive curves h m (-) defined as the point- by- 
point solution to 

Lhmiz)S = {mPziUz'Liz))]) 1 '* (5J) 
for all z € SuppPz, where we recall I(z, h) = [z — h, z + h], and let us define 

r m (z) := Lh m (z) s . 

The sequence h m (-) is the deterministic equivalent to the bandwidth H m {-) given 
by (|2.6[) . Indeed, with a large probability, H m {-) and h m (-) are close to each other 
in view of Lemma [1] below. Under Assumption |(D)| we have Pz[I\ > 7|/|' 3+1 , 
which entails together with (|5.7p that 

h m (z) < D im ~ l ^ 1+2s+ ^ (5.8) 

uniformly for z G SuppP z , where P^ = ( C r 1 /L) 2 /( 1+2s +' 3 )(72 /3+1 )- 1 /( 1 + 2s +^). 
Moreover, since P^ has a continuous density (i with respect to the Lebesgue 
measure, we have 

h m {z) > D^n- 1 /^ 2 ^ (5.9) 

uniformly for z G Supp P z , where D 2 = (cti/£) 2/(1+2s ^2m°o)~ 1/(2s+1) - We recall 
that Pjp stands for the joint law of {Z\, . . . , Z m ). 

Lemma 1. If Z satisfies Assumption \(D)\ we have for any e G (0, 1/2) 



pm 



sup 

^eSupp(P z ) 



■ff m (*) 



h m (z) 



> e 



< exp(-De 2 m a ) 



/or m Zarge enough, where a :— 2s/(l + 2s + /3) and D is a constant depending 
on a i and L. 

The next lemma provides an uniform control on the smallest eigenvalue of 
Z m (z) := Z m (z,H m (z)) under Assumption |(D)| 

Lemma 2. If Z satisfies Assumption \(D)\ there exists Xq > depending on (3 
and s only such that 

Pf[ inf A(Z m (z)) < Ao] < exp(-Dm Q ), 

for m large enough, where a = 2s/(l + 2s + /3), and D is a constant depending 
on 7, /3, s, L, <ri- 

The proofs Lemmas [T] and [2] are given in Section [6l We consider the event 
n m (e):={ inf A(Z m (z)) > A } n { sup |ff m (z)//i m (z) - 1| < e}, 

2£SuppP z 2GSuppP z 
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where e G (0, 1/2). Wc have for any / £ H Q (s, L) 

^[||r Q (/)-/||i 2( P z) lo„ l( ,)]<A - 2 (l + £ ) 2 ^ / ffi* '] 

J Jz~h m (z) p z{dt) 

where we used together the definition of fi m (e), (|3.2[) and (|5.7|) . Let us denote 
/ := SuppP^ and let I zr be the intervals from Assumption |(D)| Using together 
the fact that min^gi-u^j^ fJ-(z) > and (|5.9|) . we obtain 



P z (dz) 



.(*) 

Using the monoticity constraints from Assumption |(D)| we obtain 
P{dz) 



m 



hence S m [||T Q (/)-/|| 2 L2(Pz) l nm(£) ] < Cm- 2s /( 2s+1 ) uniformly for / e ffO(a,L). 
Using together Lemmas [1] and [2] wc obtain £"™[||tq(/) — /ll|2(p z )ln m (e) 1! ] = 
( n -2 a /(2«+i)) j and ((33]) follows. □ 

Proof of Theorem 

In model (jl.ip . when the noise e is centered and such that E(e 2 ) = 1, the risk 
of a function g : R d — > R is given by 

A(g) := £7[(F - g(A)) 2 ] = £[<7(A) 2 ] + \\g - g\\ 2 LHPx) , 

where g is the regression function. Therefore, the excess risk satisfies 

Ma) -A= \\g- g\\h iPx) , 

where A := A(g) = E[cr(X) 2 ]. Let us introduce n := \D\ the size of the learning 
sample, and M := |A| the size of the dictionary of functions {g^'; A £ A}. The 
least squares of g over the learning sample is given by 



1 ™ 

Ma) :=-J2( Y i-a(Xi)f 



n 
1=1 

Wc begin with a linearization of these risks. We consider the convex set 

C := {(6>a)aga such that 6 X > and ^ 6 X = l}, 

AeA 
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and define the linearized risks on C as 

A{6) := Y, 8xM9 W ), M6) := £ 6 x A n (g^), 



AeA AeA 



which arc linear versions of the risk A and its empirical version A n . The expo- 
nential weights w = (w\)\ e A '■= { w (g^))\eA arc actually the unique solution 
of the minimization problem 

min(X(0) + ^^0 A log0 A | Wee), (5.10) 
AeA 

where T > is the temperature parameter in the weights (|2.10p , and where we 
use the convention log = 0. Let A £ A be such that A n (g( >) = mm\ e \ A n (g^) 
Since J2xeA w *^°& (tJm) = K(w\u) > where K(w\u) denotes the Kullback- 
Leibler divergence between the weights w and the uniform weights u := (1/M) AeA ; 
we have together with (|5.10|) : 

A n (w) < A n (w) + ^K{w\u) 
1 n 



An{w) + — Y W * l0 S W 



<^n(e A ) + 



logM 
w x iog w A f —— 
1 n f — ' 1 n 

AeA 

logAf 



where e\ 6 C is the vector with 1 for the A-th coordinate and elsewhere. Let 
a > and A n :— A n (g). For any A £ A, we have 

A(w) - A = (1 + a){A n (w) - An) + A(w) - A - (1 + a){A n (w) - A n ) 

< (1 + a)(A n {e x ) - A n ) + (1 + a)^- 

1 n 

+ A(w) - A - (1 + a)(A n (w) - An). 

Let us denote by Ek the expectation with respect to Pk, the joint law of the 
learning sample for a noise e which is bounded almost surely by K > 0. We have 

E K [A(w)~A] < (l + a)min(l„(e A )-^„) + (l + a)-^|^ 

AeA I n 

+ E K [A(w) - A - (1 + a)(A n (w) -An)]. 
Using the linearity of A on C, wc obtain 
A(w) - A- (l + a)(A n (w) - A n ) < max (A(g) — A- (l + a)(A n (g) - A n )), 
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where G A := {.g (A) ; A £ A}. Then, using Bernstein inequality, we obtain for all 
S > 

P K [A(w) - A - (1 + a)(A n (w) - A n ) > 5] 

< £ Pk \a {9 ) - A - (M9) - AO > * + a{A{9) - A) 



1 + a 

< V- exp / _ n(S + a(A(g) - A)f{l + a)^ 



„ (1 + o)(A( ff ) + 2(6Q 2 + 2<xK){6 + a(A(g) - A))/3 

Moreover, we have for any 6 > and g £ Ga, 

(6 + a(A(g)-A)ni + a)- 1 

8Q 2 (A(g) -A) + 2(6Q 2 (1 + a) + 2aK)(6 + a{A{g) - A))/ 3 " 1 ' J ' 

where C{a, K) := (8Q 2 (1 + a) 2 /a + 4(6Q 2 + 2aK)(l + a)/3)~\ thus 

r cxp(—nC(a, K)u) 



E K [A{w) - A - (1 + a)(A n (» - A„)] < 2u + M- 



nC{a,K) 



If we denote by 7^ the unique solution of 7 = A exp (—7), where A > 0, we have 
(logyl)/2 < 7a < log A. Thus, if we take u — 7m / '(nC(a, K)), we obtain 

E K [A(w) A (1 + a)(I„H - A n )] < J* log *f . 
1 J C(a,K)n 

By convexity of the risk, we have 

A(w) - A > A(g) - A, 

thus 

Ek [\\9 g\\l HPx) ] < (1 + a) mm \\gW g\\l (Px) + C k 



where Ci := (1 + a)(T _1 + 3C(a, if) -1 ). It remains to prove the result when 
the noise is Gaussian. Let us denote e 7 ^ := maxi<i<„ |ej|. For any K > 0, we 
have 

E[\\g- g\\h( Px) ] = E[\\g - g\\h {Px) i^<K] +E[\\ g - .g||| 2( ^)i eSo >^] 

<E K [\\g-g\\h {Px) ] + 2Q 2 P[e^>K]. 

For K = K n := 2(2 logn) 1 / 2 , we obtain using standard results about the max- 
imum of Gaussian vectors that P[e^ > K n ] < P[e^ - £?[e£j > (2 logn) 1 / 2 ] < 
1/n, which concludes the proof of the Theorem. □ 
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6. Proof of the lemmas 
Proof of Lemma [7] 

Using together (JUJ) and (jSUJ), if If n (z) := [z - (1 + e)h m (z), z + (1 + e)/i m (z)] 
and J m (z) := we obtain for any e e (0, 1/2): 

< (l + e)M«)> = {{l + t) 2s Pz[I m {z)\ > Pz[I m (z)}} 
D {(1 + e ) 2s P z [/ m (z)] > P z [I m («)]}, 

where we used the fact that e i— ► Pz[Im(z)] is nondecreasing. Similarly, we have 
on the other side 

{H m (z) > (1 - e)h m (z)} D {(1 — £ ) 2s P z [/ m (z)] < P z [I m (z)]}. 

Thus, if we consider the set of intervals 

I m ■= [J {l m (z)}, 

zgSupp P z 



we obtain 



sup 

zgSupp P z 



H m {z) 



h m (z) 



>_ e }c{ 



sup 



Pz[I] 



Pz[I] 



>^/2}. 



Using together (|5.7[) and (|5.8p . we obtain 

P*[/ m (*)] = ^/(mtXW 3 ') > flm^ +I)/(1+2>tf) =: 
Hence, if e' := e(l + e/2)/(e + 2), we have 



(6.1) 



sup 



Pz[I] 



Pz[I] 



>^/2}c{ 



sup _ > e 



U <^ sup 



P Z [I]-P Z [I] 



> ea 



Then, Theorem [6] (see Appendix) and the fact that the shatter coefficient satis- 
fies S(T m ,m) < m(m + l)/2 entails the Lemma. □ 



Proof of Lemma []| 

Let us denote Z m (z) := Z m (z,H m (z)) where Z m (z,h) is given by (|2.4j) and 
where H m (z) is given by (|2.6[) . Let us define the matrix Z rn (z) := Z m (z, h m (z)) 
where 

1 



(Z TO (z, /i)) a ,6 



iP z [I{zMti K h 



Zi£l(z,h)- 
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Step 1. Let us define for e £ (0, 1) the event 
H m (z) 



sup 

26Supp P z 



h m {z) 



< e 



M 



sup 

zeSuppPz 



P z [I(z,H m (z))} 



P z [I{z,h m {z))} 



For a matrix A, we denote ||^4||oo : = niax aj {, |(A) aj {,|. We can prove that on fii(e), 
we have 

||Z m (z)-Z m (z)|| oc <e. 

Moreover, using LemmaHl we have P^ l [rii(e)''] < C exp(— De 2 m a ). Hence, on 
rii(e), we have for any v £ R d , ||u||2 = 1 

v T Z m (z)v > v T Z m (z)v - € 

uniformly for z G Supp Pz ■ 

Step 2. We define the deterministic matrix Z(z) := Z(z, h m (z)) where 



(Z(z,/i))„,, 



1 



Pz[nz,h)]J I(Zih) \ h 



t- z\ a + b 



Pz(dt), 



and 



A := liminf m inf X(Z(z, h m (z))) . 

z Supp Pz 



We prove that Ao > 0. Two cases can occur: either = or > 0. 

We show that in both cases, the liminf is positive. If fx(z) > 0, the entries 
(Z(z, h m (z))) a fi have limit (1 + (— l) a+b )/(2(a + b+ 1)), which defines a positive 
definite matrix. If (i(z) = 0, we know that the density /x(-) of Pz behaves as the 
power function | ■— z\^^ around z for (3(z) € (0, 0). In this case, (Z(z, h m {z))) a & 
has limit ( 1 + ( - 1 ) a + 6 ) (/3 ( z ) + 1 ) / [2 ( 1 + a + & + /? ( z ) ) ] , which defines also a definite 
positive matrix. 
Step 3. We prove that 

P%[ sup ||Z m (z)-Z(z)|| oc >e] <cxp(-^e 2 m a ). 

zGSupp Pz 

We consider the sets of nonnegative functions (we recall that I(z, h) = [z — 
h,z + h}) 



p(even) ._ 



U _ {(mI)) ^.^wjO)}. 



z e Supp Pz 

a even and 0<a<2r 



F 



(odd) 



u {( 



« E Supp 
a odd and 0<a<2r 



(odd) 



u 



zGSupp P z 
a odd and 0<a<2r 



h m {z) 



h m {z) 



L [z,z+h m ( 



-\z-h m (z) 



.)](■)}, 
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Writing I(z, h m (z)) = [z — h m {z), z) U [z, z + h m (z)] when a + b is odd, and since 

P z [I(z,h m (z))]>Ef{Z 1 ) 
for any f £ F := F icvcn ^ U Fj° dd) U Fl° dd) , we obtain 

\\Z m (z) - Z(,)|U < snp \^^ff- E ^\ . 

Hence, since x \— > x/(x+a) is increasing for any a > 0, and since a := Ef(Z\) > 

Dm -U3+l)/(l+2s +l 3) = . ^ ( sce ^J^\ we obtain 

{ sup ||Z m (z)-Z(z)|| 00 >e} 

zgSupp P z 

c r sup > e/2 i 

Then, using Theorem [7| (note that any / € F is non-negative), we obtain 
sup ||Z m ( Z )-Z(z)|| 00 >e] 

zGSupp P z 

< 4£[M(a m e/8, F, Z™)] cxp ( - D€ 2 m 2s ' {1+2s+l3) ) . 
Together with the inequality 

E[Nr{a m e/Z,F,Z?)] < DC^eJ-^VP'+D+OJ-D/^), (6.2) 
(see the proof below), this entails the Lemma. □ 

Proof of fOp 

It suffices to prove the inequality for F( cvcn ' and a fixed a e {0,...,2r}, 
since the proof is the same for irj° dd ) and _F^° dd ). We denote f z {-) := ((• — 
z )/^m(- 2 )) a l/(z.h m (z)) (')■ We prove the following statement: 

M(e,F,\\ ■ |U) < £ , c -i m V(2»+i)+(/J-i)/(2.+/») ] 



which is stronger than (|6.2j) . where || • ||oo is the uniform norm over the support 
of Pz- Let z,zi,z 2 € SuppPz- We have 



\fzA z ) ~ fz 2 i z )\ < max(a, I) 



Z — Z\ Z — Z 2 



l/lU/ 2 



hi h 2 

where hj := h m (zj) and Ij := [zj — hj, Zj + hj] for j = 1, 2. Hence, 

l^i - h 2 \ + \zi - z 2 \ 



\f*d*)-fM\ < 



mm(hi,h 2 ) 
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Using (|5.7|) together with a differentiation of z >— ► /i m (;z) 2s Pz[-f(z, /i m (z))], we 
obtain that 



\h m (zi) - h m (z 2 )\ 



< sup 

21 <Z<22 



/l m ("Z) (M^ - M Z )) ^ M( Z + h m {z))) 



{2saf)/(mL) + /i m (z) 2s + 1 (^(z - h m {z)) + n(z + h m {z))) 



zi - z 2 \ 



for any z\ < z 2 in Supp fj,. This entails together with Assumption (D) ([5 
and (IQl): 



l^m(-Zl) - /im(z 2 )| < 



2s(7i)( 2s+1 )/( 2s +' 3 + 1 ) Vcr 2 
for any zi < z 2 in Supp//. Hence, 

\f Zl (z)-fM\ <Dm^ + ^\z 1 -z 2 
which concludes the proof of (|6.2p . 



— ) |«1 1 



□ 



Appendix A: Some tools from empirical process theory 

Let A be a set of Borelean subsets of R. If := (xj, . . . , x n ) £ R™, we define 

JV(-4,«i) : = |{{xi,...,a;„}nA|AG-4}| 
and we define the shatter coefficient 

S(A,n) := max N(A, (xi, . . . ,x n )). (A.l) 

For instance, if A is the set of all the intervals [a, b] with — oo < a < b < +oo, 
we have S(A,n) — n(n + l)/2. 

Let X\, . . . , X n be i.i.d. random variables with values in R, and let us define 
H[A\ := P(X\ G A) and p n [A] := n _1 X)"=i ljf»eA- The following inequalities for 
re lative deviation s are due to Vapnik and Chervonenkis (1974), see for instance 
in lVapnikl (|l998t ). 



Theorem 6 (Vapnik and Chervonenkis (1974)). We have 



P 



and 



^{A)-Jl n {A) 
sup == > e 

AeA \ffJ-(A) 



p Pn (A) - KA) > e 



< AS{A, 2n) exp(-ne 2 /4) 



< AS{A, 2n) exp(-ne 2 /4) 



-AeA xatJA) 
where Sj L {2n) is the shatter coefficient of A defined by (|A.1|) . 
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Let (X, t) be a measured space and J 7 be a class of functions / : X — ► 
[—K,K]. Let us fix p > 1 and z™ € X n . Define the semi-distance d p (f,g) 
between / and g by 



1 71 1/ 

i=l 



and denote by B p (f,e) the d p -ball with center / and radius e. The e— covering 
number of T w.r.t d p is defined as 

N v {e,F, z?) := min (N \ 3fr, ...,f N s.t. T C U^f?^/,-, e)) . 

Theorem 7 (Hausslcr (f992)). //J 7 consists of functions f : X — > [0, if], we 



P 



|£[/(*i)]-££?=i > 



< 4£[A/- p (ae/8, ^, XH] cxp (-^) ■ 
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